122 research outputs found
Optimal and fast detection of spatial clusters with scan statistics
We consider the detection of multivariate spatial clusters in the Bernoulli
model with locations, where the design distribution has weakly dependent
marginals. The locations are scanned with a rectangular window with sides
parallel to the axes and with varying sizes and aspect ratios. Multivariate
scan statistics pose a statistical problem due to the multiple testing over
many scan windows, as well as a computational problem because statistics have
to be evaluated on many windows. This paper introduces methodology that leads
to both statistically optimal inference and computationally efficient
algorithms. The main difference to the traditional calibration of scan
statistics is the concept of grouping scan windows according to their sizes,
and then applying different critical values to different groups. It is shown
that this calibration of the scan statistic results in optimal inference for
spatial clusters on both small scales and on large scales, as well as in the
case where the cluster lives on one of the marginals. Methodology is introduced
that allows for an efficient approximation of the set of all rectangles while
still guaranteeing the statistical optimality results described above. It is
shown that the resulting scan statistic has a computational complexity that is
almost linear in .Comment: Published in at http://dx.doi.org/10.1214/09-AOS732 the Annals of
Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical
Statistics (http://www.imstat.org
Forward stagewise regression and the monotone lasso
We consider the least angle regression and forward stagewise algorithms for
solving penalized least squares regression problems. In Efron, Hastie,
Johnstone & Tibshirani (2004) it is proved that the least angle regression
algorithm, with a small modification, solves the lasso regression problem. Here
we give an analogous result for incremental forward stagewise regression,
showing that it solves a version of the lasso problem that enforces
monotonicity. One consequence of this is as follows: while lasso makes optimal
progress in terms of reducing the residual sum-of-squares per unit increase in
-norm of the coefficient , forward stage-wise is optimal per unit
arc-length traveled along the coefficient path. We also study a condition
under which the coefficient paths of the lasso are monotone, and hence the
different algorithms coincide. Finally, we compare the lasso and forward
stagewise procedures in a simulation study involving a large number of
correlated predictors.Comment: Published at http://dx.doi.org/10.1214/07-EJS004 in the Electronic
Journal of Statistics (http://www.i-journals.org/ejs/) by the Institute of
Mathematical Statistics (http://www.imstat.org
Beta-trees: Multivariate histograms with confidence statements
Multivariate histograms are difficult to construct due to the curse of
dimensionality. Motivated by -d trees in computer science, we show how to
construct an efficient data-adaptive partition of Euclidean space that
possesses the following two properties: With high confidence the distribution
from which the data are generated is close to uniform on each rectangle of the
partition; and despite the data-dependent construction we can give guaranteed
finite sample simultaneous confidence intervals for the probabilities (and
hence for the average densities) of each rectangle in the partition. This
partition will automatically adapt to the sizes of the regions where the
distribution is close to uniform. The methodology produces confidence intervals
whose widths depend only on the probability content of the rectangles and not
on the dimensionality of the space, thus avoiding the curse of dimensionality.
Moreover, the widths essentially match the optimal widths in the univariate
setting. The simultaneous validity of the confidence intervals allows to use
this construction, which we call {\sl Beta-trees}, for various data-analytic
purposes. We illustrate this by using Beta-trees for visualizing data and for
multivariate mode-hunting
Large-scale inference with block structure
The detection of weak and rare effects in large amounts of data arises in a
number of modern data analysis problems. Known results show that in this
situation the potential of statistical inference is severely limited by the
large-scale multiple testing that is inherent in these problems. Here we show
that fundamentally more powerful statistical inference is possible when there
is some structure in the signal that can be exploited, e.g. if the signal is
clustered in many small blocks, as is the case in some relevant applications.
We derive the detection boundary in such a situation where we allow both the
number of blocks and the block length to grow polynomially with sample size. We
derive these results both for the univariate and the multivariate settings as
well as for the problem of detecting clusters in a network. These results
recover as special cases the heterogeneous mixture detection problem [1] where
there is no structure in the signal, as well as scan problem [2] where the
signal comprises a single interval. We develop methodology that allows optimal
adaptive detection in the general setting, thus exploiting the structure if it
is present without incurring a relevant penalty in the case where there is no
structure. The advantage of this methodology can be considerable, as in the
case of no structure the means need to increase at the rate to
ensure detection, while the presence of structure allows detection even if the
means \emph{decrease} at a polynomial rate
- …